Why Bother Learning About Different Optimizers?
Formulating Neural Networks involves the critical step of optimizing the weights that make up the model. To do so, an appropriate loss function is used to model the error between the current set of weights and the ideal set of weights that correctly classifies the set of independent variables. This loss function is then minimized to get the optimal set of weights.
There are many ways to minimize a function. Traditionally, in a non-machine learning context, we would use partial derivatives to find the local minima. However, the usage of this method only works consistently with convex functions. Other complicated concepts have to be used to find the minima of non-convex functions. Thus, this method is not a one size fits all solution.
In the machine learning context, a more foolproof method would be iterative algorithms such as gradient descent. Falling under the group of optimizers, there are several commonly used algorithms. Of these the most widely preferred and used optimizer is Adam. In this tutorial, we will explore how Adam optimizer works.
In order to understand Adam we first have to understand some of the other optimizers.
Outline:
- Theory Behind Each Optimizer
- Application in R: Character Recognition of Digits in the MNIST Dataset
- Test Results & Conclusions
- Thank you!
- Sources
1. Theory Behind Each Optimizer
1.1 Gradient Descent Algorithm
The gradient descent algorithm should be familiar to the majority of students in the course as it was presented in week 3. For each dimension, the algorithm updates the weights according to the following formula.
a) Calculating and updating weights of Neural Network:
- \(\omega_i\): dimension
- \(\alpha\): learning rate
- \(L\): loss function
\[ \begin{align*} \omega_1 &\leftarrow \omega_1 - \alpha \frac{dL}{d\omega_1} \\ \omega_2 &\leftarrow \omega_2 - \alpha \frac{dL}{d\omega_2} \end{align*} \]
If the gradient is negative, a step is taken in the direction of the gradient and vice versa. The size of the step taken to update the weights is determined by the learning rate.
Limitations of Gradient Descent:
- The standard Gradient Descent algorithm is an optimizer which is not adaptive. This means that in the the learning rate hyperparameter has to be set and the model would then have to be trained. If the model does not yield good results, the learning rate would have to be altered in a new formulation of the Neural Network and the model would have to be trained again until obtaining satisfactory results. As such, in a large dataset, the training time for a model could take a large amount of time. As such, adaptive optimizers were created where the algorithms would change in subsequent iterations while training the model.
- Additionally, the Gradient Descent algorithm takes an unnecessarily long time to converge. This is due to the algorithm taking an equally large step in all of its dimensions which could lead to the progression of each search oscillating around the search space based on the gradient.
1.2 Gradient Descent with Momentum Algorithm
An adaptation to the previous algorithm is the Gradient Descent with Momentum algorithm. Momentum refers to the inclusion of an additional hyperparameter, \(\beta\), which controls the amount of past gradients to include in the updating of Neural Network weights. This is also known as a weighted average of gradients. The addition of momentum intends to accelerate the search in the relevant direction and dampen oscillations within the search space. Subsequently, the weights are updated based on a product of the weighted averages and the learning rate.
a) Calculating and updating weighted averages:
- The following equations shows how the weighted average of gradients in each dimension is updated in the Gradient Descent with Momentum algorithm
- \(v_{d\omega_i}\): the weighted average of gradients for the Gradient Descent with Momentum algorithm
- \(\beta_1\): parameter that determines ratio between past gradients and new gradients between [0,1] for the Gradient Descent with Momentum algorithm
\[ \begin{align*} v_{d\omega_1} &\leftarrow \beta_1 v_{d\omega_1} + (1-\beta_1)\frac{dL}{d\omega_1} \\ v_{d\omega_2} &\leftarrow \beta_1 v_{d\omega_2} + (1-\beta_1)\frac{dL}{d\omega_2} \end{align*} \]
b) Calculating and updating weights of Neural Network:
- The following set of equations show how the weighted average of gradients in each dimension is used in subsequent updating of weights along with learning rate, \(\alpha\).
- \(\alpha\): learning rate
\[ \begin{align*} \omega_1 &\leftarrow \omega_1 - \alpha V_{d\omega_1} \\ \omega_2 &\leftarrow \omega_2 - \alpha V_{d\omega_2} \end{align*} \]
In the dimension that is not towards the local minima, the weighted average of the gradients will give an extremely small number as it averages a positive and a negative number. However, in dimension towards the minima, the weighted average of the gradients will give a large negative number as it averages two negative numbers. The image below depicting a convex bowl function should help with understanding this concept. Effectively, this causes the learning rate to be adaptive to be faster in the dimension towards the minima and slower in the other.
1.3 Root Mean Square Propogation (RMSProp)
The RMSProp algorithm uses a slightly different method to the Gradient Descent with Momentum algorithm in calculating weighted average of gradients and subsequently updating the weights of the Neural Network. However, the intended outcome on the learning rate is similar in causing the learning rate to be adaptive and faster in the dimension towards the minima and slower in the other.
a) Calculating and updating weighted averages
- The following equation shows how the weighted average of gradients in each dimension is updated in the RMSProp algorithm
- \(S_{d\omega_i}\): weighted average of gradients for RMSProp algorithm
- \(\beta_2\): parameter that determines ratio between past gradients and new gradients between [0,1] for the RMSProp algorithm
\[ \begin{align*} S_{d\omega_1} &\leftarrow \beta_2 S_{d\omega_1} + (1-\beta_2)(\frac{dL}{d\omega_1})^2 \\ S_{d\omega_2} &\leftarrow \beta_2 S_{d\omega_2} + (1-\beta_2)(\frac{dL}{d\omega_2})^2 \\ \end{align*} \] Note the first difference between the RMSProp and Gradient Descent with Momentum optimizers is that the gradient used in calculating the weighted averages is squared.
b) Calculating and updating weights of Neural Network:
- The second set of equations show how the weighted average of gradients in each dimension is used in subsequent updating of weights along with learning rate, \(\alpha\):
- \(\epsilon\): stabilising constant to prevent gradients from exploding when the denominator is extremely small. Typically takes value of \(10^{-8}\)
\[ \begin{align*} \omega_1 &\leftarrow \omega_1 - \alpha \frac{(\frac{dL}{d\omega_1})}{\sqrt{S_{d\omega_1}} + \epsilon} \\ \omega_2 &\leftarrow \omega_2 - \alpha \frac{(\frac{dL}{d\omega_2})}{\sqrt{S_{d\omega_2}} + \epsilon} \end{align*} \] Here, the second difference between the RMSProp and Gradient Descent with Momentum optimizers is in the updating of weights of the Neural Network. These are 2 different calculations of updating the Neural Network weights intended to cause the convergence of the algorithm to occur faster.
1.4 Adaptive Moment Estimation (Adam)
The Adam algorithm is a combination of the Gradient Descent with Momentum and RMSProp algorithms. Note that in the Gradient Descent with Momentum algorithm and the RMSProp algorithm, the calculations of weighted averages of gradients are denoted by \(V_{d\omega_i}\) and \(S_{d\omega_i}\) respectively while the ratio between past gradients and new gradients is also denoted by \(\beta_1\) and \(\beta_2\) respectively. This is intentional as the aggregation of the 2 algorithms in the Adam algorithm utilizes both methods of calculating the weighted averages of gradients.
a) Calculating RMSProp and Gradient Descent with Momentum weighted averages:
- \(V_{d\omega_i}\): the weighted average of gradients for the Gradient Descent with Momentum algorithm
- \(S_{d\omega_i}\): weighted average of gradients for RMSProp algorithm
- \(\beta_1\): parameter that determines ratio between past gradients and new gradients between [0,1] for the Gradient Descent with Momentum algorithm
- \(\beta_2\): parameter that determines ratio between past gradients and new gradients between [0,1] for the RMSProp algorithm
\[ \begin{align*} V_{d\omega_1} &\leftarrow \beta_1 V_{d\omega_1} + (1-\beta_1)\frac{dL}{d\omega_1} \\ V_{d\omega_2} &\leftarrow \beta_1 V_{d\omega_2} + (1-\beta_1)\frac{dL}{d\omega_2} \\ S_{d\omega_1} &\leftarrow \beta_2 S_{d\omega_1} + (1-\beta_2)(\frac{dL}{d\omega_1})^2 \\ S_{d\omega_2} &\leftarrow \beta_2 S_{d\omega_2} + (1-\beta_2)(\frac{dL}{d\omega_2})^2 \\ \end{align*} \]
However, the terms \(V_{d\omega_i}\) and \(S_{d\omega_i}\) are all initialized to 0 and the terms \(\beta_1\) and \(\beta_2\) are set to be close to 1. As such, the algorithm would be biased towards 0 in the first few updates. Hence, the Adam algorithm implements bias correction in each iteration, \(t\), to overcome this initial bias which is towards 0. This is an additive feature of the Adam algorithm that the other algorithms from 1.1 to 1.3 do not implement.
b) Bias correction of weighted averages:
- \(V_{d\omega_i}^{\text{corrected}}\): bias corrected weighted average of gradients between [0,1] for the Gradient Descent with Momentum algorithm
- \(S_{d\omega_i}^{\text{corrected}}\): bias corrected weighted average of gradients between [0,1] for the RMSProp algorithm
- \(t\): iteration number
\[ \begin{align*} V_{d\omega_1}^{\text{corrected}} &= \frac{V_{d\omega_1}}{1-\beta_1^t} \\ V_{d\omega_2}^{\text{corrected}} &= \frac{V_{d\omega_2}}{1-\beta_1^t} \\ S_{d\omega_1}^{\text{corrected}} &= \frac{S_{d\omega_1}}{1-\beta_2^t} \\ S_{d\omega_2}^{\text{corrected}} &= \frac{S_{d\omega_2}}{1-\beta_2^t} \end{align*} \]
c) Calculating and updating weights of Neural Network:
- \(\epsilon\): stabilising constant to prevent gradients from exploding when the denominator is extremely small. Typically takes value of \(10^{-8}\)
- \(\alpha\): learning rate
\[ \begin{align*} \omega_1 &\leftarrow \omega_1 - \alpha \frac{V_{d\omega_1}^{\text{corrected}}}{\sqrt{S_{d\omega_1}^{\text{corrected}}} + \epsilon} \\ \omega_2&\leftarrow \omega_2 - \alpha \frac{V_{d\omega_2}^{\text{corrected}}}{\sqrt{S_{d\omega_2}^{\text{corrected}}} + \epsilon} \end{align*} \]
As seen in the prior equations explaining the functionality of the Adam algorithm, the combination of Gradient Descent with Momentum and RMSProp algorithm’s calculations for weighted averages of gradients and further bias correction for each of these terms intends to cause the Adam algorithm to converge even faster as the starting point for learning is not biased towards 0. As such, the Adam algorithm was theorized by researchers to be the best adaptive optimizer.
To emphasize that these algorithms are indeed different, we found this interesting animation which shows the different paths taken by different optimizers in reaching a minimum point. Here, we note that different algorithms may result in finding different minimas and the time taken to reach (or converge towards) these minimas vary greatly from one optimizer to another. This is purely hypothetical and hence we can proceed to conduct some tests to see whether a similar phenomenon occurs in practice.